feat(core): GPU instancing auto-batching#2957
Conversation
…nce data Introduce automatic GPU instancing for MeshRenderer. The system scans renderer-group uniforms across shader passes, builds a unified std140 UBO layout, and packs per-instance data (ModelMat, Layer, etc.) each frame. Key changes: - InstanceDataPacker: packs renderer data into shared UBO for instanced draw - ShaderFactory: unified _scanInstanceUniforms, _buildLayout, _injectInstanceUBO - MeshRenderer._canBatch/_batch: instancing merge logic - ShaderPass/SubShader: instance-aware compilation with macro cache - GLSLIfdefResolver: compile-time #ifdef resolution for instance field scanning - MacroCachePool: pooled ShaderMacroCollection for shader program caching - RenderQueue: instance-aware draw path with UBO binding
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds WebGL2 GPU instancing: instance UBO management, instance-aware shader compilation and caching, batching API/signature changes, render-queue instanced draw paths, device/GL helpers, examples and E2E tests, and supporting shader/uniform enhancements. Changes
Sequence Diagram(s)sequenceDiagram
participant RenderQueue as RenderQueue
participant BatcherManager as BatcherManager
participant InstanceBatch as InstanceBatch
participant GPUBuffer as GPU Constant Buffer
participant ShaderProgram as ShaderProgram
RenderQueue->>BatcherManager: request instanceBatch (lazy)
BatcherManager->>InstanceBatch: setLayout(layout)
InstanceBatch->>GPUBuffer: create/realloc UBO (if needed)
loop per instanced chunk
RenderQueue->>InstanceBatch: upload(renderers[], start, count)
InstanceBatch->>InstanceBatch: pack per-instance fields into CPU buffer
InstanceBatch->>GPUBuffer: setData(range, Discard)
end
RenderQueue->>ShaderProgram: bindUniformBlocks(bindingMap)
ShaderProgram->>GPUBuffer: uniformBlockBinding(bindingPoint)
RenderQueue->>RenderQueue: issue drawPrimitive with instanceCount
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Move _gpuInstanceMacro after _macroMap declaration to fix static initialization order. Also apply prettier formatting fixes.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## dev/2.0 #2957 +/- ##
===========================================
- Coverage 77.38% 76.63% -0.76%
===========================================
Files 900 907 +7
Lines 98752 99573 +821
Branches 9817 9819 +2
===========================================
- Hits 76415 76303 -112
- Misses 22170 23097 +927
- Partials 167 173 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…es layout The instance UBO is injected at compile time by ShaderFactory, so the original shader uniform declarations don't need modification.
Bring back normalMat extraction, transform_declare reordering, trailing whitespace fixes, and VertexPBR indent fix. Only the renderer_Layer relocation stays reverted.
… 3×vec4 - Fix SubRenderElement.set() not resetting instanceDataPacker, causing stale packer references from previous frames to break all batching - Use whitelist + _group fallback for identifying renderer uniforms in _scanInstanceUniforms (fixes _group===undefined for ModelMat) - Store ModelMat as 3×vec4 (affine rows) instead of mat4 in UBO, saving 16 bytes per instance (structSize 80→64, +25% instances/batch) - Add camera_VPMat to transform_declare.glsl for derived MVP define - Extract struct definition outside uniform block (GLSL ES 3.00 compat) - Fix _insertUBOBlock to only scan initial #define preamble - Pass instanceID to fragment shader via flat varying
…lity Wrap derived NormalMat define with mat4() so instancing and non-instancing paths both produce mat4, avoiding shader compilation errors. Add custom instance data example to verify per-renderer uniform batching.
Rename elementA/elementB to preSubElement/subElement across all renderer subclasses and BatchUtils. Change _batch signature so preSubElement is nullable (null = batch head, no previous element to merge with), and subElement is always required.
- Move RENDERER_GPU_INSTANCE macro from ShaderMacro to InstanceDataPacker - Rename getOrCreate() to get() in InstanceDataPackerPool - Clear compileMacros in InstanceDataPacker.reset()
Move macro merging, layout computation, and UBO packing from batch phase to render phase. _batch now only collects renderers into a pre-allocated list. RenderQueue.render handles macro union, layout lookup, and splits by maxInstanceCount for sub-batch rendering. - SubRenderElement: instanceDataPacker → instancedRenderers (pooled array) - MeshRenderer._canBatch: remove maxInstanceCount check - MeshRenderer._batch: only push renderers, zero allocation - InstanceDataPacker: remove compileMacros/addRenderer/instanceCount, add packAndUpload(renderers, start, count) - InstanceDataPackerPool: remove uploadBuffer, simplify reset - BatcherManager: remove instancing uploadBuffer call
Packer is now stateless (setLayout + packAndUpload + draw), so only one instance is needed. Discard upload ensures no GPU stall when reusing the same buffer across shadow and main passes. - Delete InstanceDataPackerPool.ts - BatcherManager: instanceDataPackerPool → instanceDataPacker - Remove resetInstanceDataPackerPool lifecycle - Saves GPU memory by using single buffer instead of pool
Rename class, file, and all references to better reflect its role as an instance batch manager rather than a generic data packer.
It's a macro-keyed map, not a pool (no borrow/return semantics).
Align variable and method names with MacroMap rename — these are maps, not pools.
Callers always pass a valid buffer, no need for null guard.
- nativeBuffer → buffer, _uboData → _data - Replace separate instanceFields/_structSize with single _layout ref - setLayout() now takes InstanceLayout directly
- Remove unnecessary null guard on _layout - Inline uploadElements variable - Destructure floatView/intView from this - Improve worldMatrix comment
- Remove unnecessary component→renderer alias, use component directly - Hoist bindUniformBlocks/bindUniformBufferBase out of sub-batch loop - Move primitive.instanceCount=0 after loop (only need to reset once) - Remove redundant let layout = undefined
- Upgrade lint-staged from v10.5 to v16.4.0
- Fix glob from *.{ts} to **/*.ts to match subdirectory files
- Remove redundant git add from tasks
- eslint 8.44 → 8.57 - @typescript-eslint/parser and eslint-plugin 6.x → 8.x - Eliminates "unsupported TypeScript version" warning
Dispose means the object is permanently released, so null the array to free memory instead of just clearing length.
Eliminate redundant SubShader._getInstanceLayout / ShaderPass._scanInstanceFields by reusing the shader compilation chain — _injectInstanceUBO now returns InstanceLayout directly, stored on ShaderProgram._instanceLayout.
|
感谢详细的 review,逐条回复: Review 1[P1] ShaderPass.ts:175 — isGPUInstance 未加 WebGL2 守卫不需要。 上游 [P1] MeshRenderer.ts:177-184 — _canBatch 缺少 renderer 级 shader 状态比较已修复。 加了 [P1] ShaderFactory.ts:59-68 — renderer_LocalMat / renderer_MVInvMat derived 但无 #define已删除这两个 uniform(commit 761bec4)。 引擎内置 shader 没有任何代码使用这两个变量,Unity/Unreal/Godot 也都没有对应的内置变量。删除后还省了每帧一次 [P1] RenderQueue.ts:116 — compileMacros 缺少 renderer 级宏不存在。 [P2] RenderQueue.ts:213-218 — mask 状态未比较不需要。 [P2] ShaderFactory.ts — mat3x4 仿射假设不需要额外注释。 [P2] noise_common.glsl 残留常量不相关。 本 PR 是 GPU instancing,没有改动 noise 相关文件,与 #2960 不存在文件交叉。 [P2] SubRenderElement.ts dispose — 池复用 NPE不存在。 [P2] GLBuffer.ts — ConstantBuffer 无 WebGL2 检查不需要。 同 P1 第一条,RHI 层是内部代码,信任上游约束。 [P3] ShaderProgramMap.ts — engine 参数可选已修复(commit 91e2ce2)。 改为 required。 Review 2[P0] SubRenderElement.ts:49 — dispose 后池复用 NPE同上,不存在。 [P1] NormalMat inverse() 每顶点计算确实存在 GPU 开销,但不是本 PR 需要解决的。 非 instancing 路径下 NormalMat 是 CPU 预计算上传的,instancing 路径改为 [P1] 移除 opaque distance sort 影响 overdraw经过详细分析,不需要恢复距离排序。 移动端 TBDR 架构(Apple HSR、Mali FPK、Adreno LRZ)在硬件层面处理 overdraw,距离排序对 fragment 开销没有收益。顶点开销方面,排不排序都一样(被遮挡物体的顶点仍需处理)。合批优先的排序策略是行业趋势——Unity opaque 队列也是 shader/material 优先,Unreal/Godot 移动端同样不做距离排序。 [P2] InstanceBatch 小批次显存浪费当前设计合理。 UBO buffer 全局复用(只有一个 InstanceBatch),不是每个 batch 分配一个 buffer。 [P2] _scanInstanceUniforms 误匹配注释概率极低,且 PR 描述已规划 ShaderLab 预编译替代。 不阻塞。 [P2] renderer_MVInvMat derived 无 #define已删除(同上)。 [P2] _canBatch 中 isWebGL2 每次调用开销可忽略。 [P3] _derivedDefines 格式不影响功能,保持现状。 简化建议回复
|
…tin map Log error when renderer-group array uniforms are found during instancing UBO injection — arrays are not supported (consistent with Unity/Unreal/Godot). Also replace Map with Record<string, boolean> for _builtinRendererUniforms.
Review 5 回复[P1] NormalMat per-vertex inverse()同 Review 2 回复。 是已知 trade-off,GPU ALU 换 UBO 空间。大多数场景无 non-uniform scale 时 [P2] _derivedDefines 同时注入 VS 和 FS需要注入。
camera uniforms( [P2] Opaque 排序移除距离因子同 Review 2 回复。 移动端 TBDR 硬件处理 overdraw,距离排序无收益。合批优先是行业趋势。 [P2] upload() 中 worldMatrix getter 触发 lazy update这是 Transform 的公开契约,不是隐式依赖。 [P2] _scanInstanceUniforms 正则扫描同 Review 2 回复。 ShaderLab 预编译路径有结构化元数据,将替代正则方案。 [P3] _builtinRendererUniforms Map已修复。 最新 commit 已改为 |
Review 6 回复所有问题在之前的回复中已覆盖,逐条标注: [P1] NormalMat inverse() — 已回复(Review 2、5)已知 tradeoff,GPU ALU 换 UBO 空间。后续可优化。 [P1] opaque distance sort — 已回复(Review 2、5)移动端 TBDR 硬件处理 overdraw,合批优先是行业趋势。 [P2] _derivedDefines 注入 VS 和 FS — 已回复(Review 5)camera uniforms 在 VS 和 FS 中都有声明(通过 [P2] InstanceBatch setData 参数语义 — 误报
[P3] _derivedDefines 末行缺 \n — 已回复(Review 2)不影响功能,保持现状。 |
| } | ||
|
|
||
| /** | ||
| * @internal |
There was a problem hiding this comment.
不需要 @internal ,因为没有 export
后面 export 的 interface 应该放前面点
There was a problem hiding this comment.
Done, 已移除 @internal,export 的 interface 也调整到前面了
|
Top 1 — SubRenderElement ObjectPool NPE: Frame 1: new SubRenderElement() → instancedRenderers = [] ← class field Fix: Top 2 — _buildLayout structSize=0 崩溃: fieldMap 有 entry 但 type 全不在 _std140TypeInfoMap → addField 全部跳过 Fix: if (instanceFields.length === 0) return null;
const structSize = Math.ceil(currentOffset / 16) * 16;
const instanceMaxCount = Math.floor(maxUBOSize / structSize);
if (instanceMaxCount < 2) {
Logger.warn("GPU Instancing: struct too large, falling back");
return null;
} |
|
感谢详细分析,逐条回复: Top 1 — SubRenderElement ObjectPool NPE: 不存在。 Top 2 — _buildLayout structSize=0 崩溃: 不存在。 |
|
回复自动 CR(基于 266c346 的 review): [P0] structSize=0 导致 Infinity — 不存在。 [P1] interface 位置 — 已修复(de638a3)。export 的 [P2] array uniform strip 后 undeclared identifier — 已知限制。整体退出 instancing 需要在 RenderQueue 加完整的 fallback 路径(逐 renderer 补 transform upload + uniform upload + draw),改动量大且与 normal draw 逻辑重复。当前方案:官方不支持 renderer shaderData 自定义数组 uniform, [P2] isInstanced 与 layout 分层不一致 — 理论隐患但实际不触发。 简化建议 - _derivedDefines template literal — 可以改,但优先级低。 |
GuoLei1990
left a comment
There was a problem hiding this comment.
总结
为 MeshRenderer 引入自动 GPU Instancing,核心架构设计清晰:UBO 打包 per-instance renderer 数据 + shader 注入重映射 uniform 到 UBO 数组访问。几个关键设计决策都很扎实:
- mat3x4 仿射优化:利用 ModelMat 恒为仿射变换的架构保证,每实例省 16 字节,直接提升 instanceMaxCount
- Per-pass 独立布局:不同 pass 按各自 uniform 需求计算 struct 大小,ShadowCaster 等轻量 pass 获得更高 instance 容量
- 宏集合相等性检查:
_canBatch使用macroCollection.isEqual()系统性覆盖所有 renderer 级宏差异,避免逐属性枚举 - Buffer orphaning:
SetDataOptions.Discard避免 GPU stall,partial upload 节省带宽 - 排序策略:opaque 按 material/primitive 排序合批优先,符合移动端 TBDR 架构特点
iPhone 16 Pro Max 实测 30→50 FPS(+67%)验证了设计价值。
之前多轮 review 提出的问题大部分已修复或已被作者合理解释。逐一核实后,以下是仍存在的新发现:
问题
P2
-
[P2] ShaderFactory.ts:317 —
_std140TypeInfoMap缺少bool类型。当用户在 renderer shaderData 中设置了bool类型的自定义 uniform 时,_scanInstanceUniforms会从 shader source 中移除该 uniform 声明,但_buildLayout.addField因找不到 type info 而静默跳过(if (!info) return),导致该 uniform 既不在 UBO 中也没有原始声明,shader 编译失败。虽然当前引擎内置 shader 没有 renderer 组的booluniform 通过 MeshRenderer 路径,但用户自定义 shader 可能触发。bool在 std140 中与int布局相同(4 bytes, 4-byte align),建议补全:private static readonly _std140TypeInfoMap: Record<string, { size: number; align: number }> = { bool: { size: 4, align: 4 }, // ← 新增 float: { size: 4, align: 4 }, // ... };
同时在
_packFuncMap中补全bool的 pack 函数(与int相同,使用 intView)。或者,如果不打算支持bool,应在_scanInstanceUniforms中遇到不支持的类型时Logger.error并保留原始声明(不要 strip) -
[P2] ShaderFactory.ts:289-304 —
_scanInstanceUniforms对 derived uniform 的处理存在边界问题。当一个自定义 shader 只声明了 derived uniform(如uniform mat4 renderer_MVPMat)而没有声明renderer_ModelMat时,derived uniform 被无条件 strip(line 294:if (isDerived) return ""),但因为fieldMap为空,injectInstanceUBO在 line 201 提前返回,不会注入_derivedDefines。结果是 derived uniform 被移除但没有替代的#define,shader 编译失败。实际上这个路径只在isGPUInstance=true时触发(需要_canBatch通过),且标准 shader 都通过transform_declare同时声明所有 transform uniform,所以当前不会在生产中触发。但作为防御,建议在 early return 之前检查是否有 derived uniform 被 strip 了——如果有 derived 但没有 non-derived 字段,说明 shader 声明不完整,应Logger.warn并回退(重新注入原始声明)
简化建议
-
InstanceBatch.upload 中 modelMat 特殊路径(InstanceBatch.ts:71-74):当前通过
propertyId === modelMatId判断来走entity.transform.worldMatrix特殊路径。这是因为 instancing 跳过了_updateTransformShaderData,导致propertyValueMap中没有最新的 worldMatrix。考虑是否可以在 batch 收集阶段(BatcherManager.batch)统一触发一次轻量的 worldMatrix 写入(只写 ModelMat 到 shaderData),这样 upload 就不需要特殊路径。不过这会引入额外的 shaderData.setMatrix 调用,在实例数量大时可能抵消收益,所以当前的特殊路径是合理的 tradeoff -
renderer_NormalMat 的 mat4 包装(ShaderFactory.ts:56):
#define renderer_NormalMat mat4(transpose(inverse(mat3(renderer_ModelMat))))产生 mat4,但实际使用处(normal_vert.glsl:3)立即mat3(renderer_NormalMat)取回 mat3。虽然编译器大概率会优化掉 mat4→mat3 的转换,但如果将 define 改为直接产出 mat3 并修改使用处以 mat3 接收,语义更清晰。不过这涉及所有引用renderer_NormalMat的 shader 的兼容性(有的地方可能做了 mat4 运算),改动面较大,优先级低
整体设计方向正确,实现质量高,上述 P2 问题不阻塞合并
Closes #194
Summary
InstanceBatch将 renderer uniform(ModelMat、Layer 等)打包到共享的 std140 UBO 中ShaderFactory.injectInstanceUBO自动扫描 shader 中的 renderer uniform,替换为 UBO 数组访问 +#define重映射mat3x4存储(仿射优化,48 字节 vs 64 字节),派生 uniform(MVMat/MVPMat/NormalMat)通过#define实时计算MeshRenderer._canBatch/_batch实现合批判定(相同 primitive + material + front-face)ShaderProgram._recordLocation跳过 UBO 成员(location === null),避免无用 ShaderUniform 创建Performance
测试场景: 2500 glTF 模型(Avocado) + 2500 自定义 shader 立方体,全部动态旋转 + 缩放 + 颜色动画
iPhone 实测截图(59 FPS / 21 Draw Calls / 5000 objects):
Future Optimization
injectInstanceUBO通过正则扫描 GLSL 文本获取 renderer uniform 信息。如果 ShaderLab 预编译时提供 uniform 元数据(name, type, group),可以消除正则扫描,改为精确拼接,提升代码健壮性和可扩展性Key Files
RenderPipeline/InstanceBatch.tsRenderPipeline/RenderQueue.tsshaderlib/ShaderFactory.tsshader/ShaderPass.tsshader/ShaderProgram.ts_instanceLayout字段,跳过 UBO 成员反射mesh/MeshRenderer.ts_canBatch/_batch合批逻辑shader/ShaderProgramMap.tsTest plan